대규모 병렬 처리 프로그래밍: 실습 중심 접근법: GPU 컴퓨팅의 탄생

GPU의 탄생은 "실시간 필수 조건": 1/60초(16.67밀리세컨드) 이내에 복잡한 3차원 장면을 렌더링해야 하는 절대적인 요구사항이다. 반면, CPU는 낮은 지연 시간을 갖춘 직렬 실행에 최적화된 다중 코어 경로 직렬 실행 방식을 따르며, 해상도가 증가하면서 성능 한계에 부딪혔다.

1. 16.67ms 제약 조건

1990년대 중반, 게임 산업은 위기 상황에 놓였다. 단일 스레드로 작동하는 CPU는 인공지능과 물리 계산을 처리하면서도 수백만 픽셀 값을 충분히 빠르게 계산하지 못해 매끄러운 움직임을 유지할 수 없었다. 이로 인해 반복적인 그래픽 파이프라인을 전담 처리하기 위한 하드웨어를 개발하게 되었다.

2. 스캔 라인 인터리빙(SLI)

내부 병렬 배열 기술 이전에, 3dfx사는 스캔 라인 인터리빙(SLI)을 도입했다. 두 개의 물리적 그래픽 카드를 사용하여 번갈아 가며 수평선을 계산함으로써, 업계는 단일 스레드 속도에서 '폭력적' 병렬 처리 능력으로의 초점을 전환하게 되었다.

3. 처리량 대 지연 시간

GPU의 설계 초기에는 복잡한 분기 예측보다 간단한 산술 연산 유닛에 더 많은 실리콘 면적을 할당하는 것이 우선시되었다. 이 '넓고 느린' 철학 덕분에, GPU는 삼각형의 반복적인 수학 계산을 처리할 수 있었고, CPU는 비병렬 로직 처리에 집중할 수 있었다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the specific 'time budget' required for 60 frames per second (FPS)?

33.33ms

16.67ms

10.00ms

100.00ms

QUESTION 2

How did 3dfx's SLI achieve early parallelism in consumer hardware?

By increasing the clock speed of a single chip.

By having two cards render alternating horizontal scan lines.

By sharing AI logic between the GPU and CPU.

By reducing the resolution of the frame.

QUESTION 3

Why did the GPU diverge from the standard multicore trajectory of CPUs?

GPUs needed deeper caches for complex branching.

GPUs prioritize throughput of simple math over low-latency serial logic.

CPUs became too expensive to manufacture for 3D graphics.

GPU architectures were designed to be smaller than CPUs.

QUESTION 4

In the context of 1990s gaming, what was the 'Real-Time Imperative'?

The requirement to run physics simulations on the GPU.

Processing millions of pixels within the strict frame window.

The transition from 16-bit to 32-bit computing.

Allowing the CPU to handle rasterization.

QUESTION 5

What is meant by the GPU's 'Wide and Slow' philosophy?

Using many simple processors at lower clock speeds to do massive work.

Designing physically wide chips that take longer to process data.

A design that favors high latency but high memory capacity.

Optimizing for single-threaded serial logic.